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Preface 

This  note  describes  and  evaluates  an  attempt  to  analyze  the  picto- 
rial structure  of  Chinese  characters.  The  approach  underlying  this 
analysis  is  borrowed  from  the  field  of  linguistics.   The  collection  of 
objects  under  study  is  the  set  of  all  well-formed  Chinese  characters; 
that  is,  the  set  of  all  character-like  structures  which  an  informant 
accepts  as  actually  occurring  or  possibly  occurring.   The  set  of  these 
well-formed  Chinese  characters  is  considered  to  be  a  language.   The 
corpus  on  which  the  grammar  is  based  is  Mathews'  (3)  (see  Bibliography, 
page  118)  minus  a  handful  of  entries  which  are  deliberately  excluded .1/ 

The  task  of  analyzing  this  language  is  taken  to  be  that  of  con- 
structing a  grammar  to  account  for  the  pictorial  structure  of  charac- 
ters by  generating  objects  which  correspond  to  characters.   These 


1/  More  precisely,  our  corpus  consists  of  hand-written  copies  of  the 
characters  in  Mathews'.   See  Appendix  A  for  a  listing  and  classification 
of  the  excluded  entries. 
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objects  are  well-formed  at  a  high  (syntactic)  level  of  structure  only, 
and  are  properly  viewed  as  requiring  further  processing  by  an  output 
transducer  which  is  yet  to  be  constructed.   This  transducer  will 
contain  rules  for  size  and  shape  variation  in  character  sub-parts  and 
rules  for  precise  spatial  placement  of  character  sub-parts. 

The  grammar  1/  displayed  in  this  note  is  a  mixture  of  various  for- 
malisms which  in  turn  are  inspired  by  formalisms  used  in  the  fields  of 
linguistics  and  computation.   Although  it  is  a  mixture,  the  central 
property  of  the  grammar  is  its  capability  of  imposing  a  kind  of  phrase 
structure  on  its  output  characters. 2/ 


1/  The  grammar,  GCC-3  (Grammar  for  Component  Combination,  number  three), 
is  a  refinement  and  extension  of  the  concepts  developed  in  two  previously 
constructed  grammars:  GCC-1,  in  Rankin,  Sillars,  and  Hsu  (5)  and  GCC-2, 
in  Rankin  (4). 

2/  See  Chomsky  (2),  Chapter  4  for  a  discussion  of  phrase  structure 

grammars  for  natural  languages. 
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Phrase  structure  grammars  are  well-suited  for  describing  natural 
languages  or  sub-languages  whose  sentences  show  an  internal  structure 
in  terms  of  hierarchical  combinations  of  smaller  units  (e.g,  words  or 
morphemes)  into  larger  units  (e.g.,  phrases  and  ultimately  sentences). 
Phrase  structure  grammars  are  also  well-suited  for  describing  such 
artificial  languages  as  some  of  the  logical  calculi  and  some  of  the 
programming  languages. 

It  is  now  suggested  that  modified  phrase  structure  grammars  can 
well  describe  languages  other  than  the  one-dimensional  languages  just 
mentioned.  In  particular,  modified  phrase  structure  grammars  are  well- 
suited  for  describing  the  two-dimensional  language  of  Chinese  characters. 

The  following  discussion  will  be  in  three  parts.  First,  there  is 
a  section  dealing  with  the  description  of  the  various  phenomena  and 
processes  in  the  language  of  Chinese  characters.   Second,  there  is  a 
section  in  which  the  grammar  constructed  to  account  for  this  language 
is  characterized  and  discussed.   Finally,  there  is  a  section  dealing  with 

the  evaluation  of  this  grammar  in  terms  of  criteria  originally  proposed 

for  evaluating  grammars  for  natural  languages. 
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ABSTRACT 

A  linguistic  analysis  of  one  aspect  of  the 
structure  of  Chinese  characters  is  presented.   The 
analysis  is  an  extension  into  two  dimensions  of   a 
general  approach  to  one-dimensional  language  study. 
Results  of  the  analysis  are  in  the  form  of  a  three- 
level  generative  grammar.   The  first  level  formalizes 
restrictions  governing  the  general  complexity  of  well- 
formed  Chinese  characters;  the  second  level  formalizes 
co-occurrence  constraints  among  character  components 
and  the  particular  spatial  arrangement  of  these  com- 
ponents in  classes  of  characters;  the  third  level 
constitutes  a  procedure  for  selecting  actual  components 
from  a  lexicon.   Finally  an  evaluation  of  the  grammar 
is  presented,  in  terms  of  criteria  used  in  evaluating 
natural  language  grammars. 
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I.   THE  LANGUAGE 
A.    INTRODUCTORY 

This  discussion  of  the  language  is  based  on  two  fundamental  notions: 
component  and  frame.   Chinese  characters  may  be  viewed  as  occupying  a 
hypothetical  square ;1/  they  may  also  be  viewed  as  being  composed  of  re- 
curring sub-parts  -  here  called  components.  The  segmentation  of  char- 
acters into  components  segments  the  square  in  a  variety  of  ways.   These 
segmentations  of  the  square  are  here  called  frames.   Frames  can  thus  be 
viewed  as  abstract  representations  of  classes  of  characters. 

With  the  notion  of  component  and  frame  now  introduced,  the  notion 
of  positioning  of  components  with  respect  to  frames  can  be  discussed. 
The  following  position  classes  are  necessary  in  a  description  of 
Chinese  character  component  combination:   NORTH,  SOUTH,  WEST,  EAST, 
BORDER,  INTERIOR,  and  FREE.   Thus,  the  character  *"(=]   is  segmented 


as  follows: 


N 


and  is  represented  by  the  frame 


1/  This  is,  in  fact,  a  part  of  the  traditional  native  Chinese  view  of 
Chinese  characters. 


the  component  W  is  a  WEST  and  the  component   P]     is  an  EAST. 


The  character 


& 


is  segmented  as  follows: 


5c 


,  and  is  re- 


presented by  the  frame 


;  the  component   V-1-?   is  a  NORTH  and 


the  component   Jv    is  a  SOUTH.   The  character  |;fc|   is  segmented  as 

,  and  is  represented  by  the  frame 


follows: 


□ 


;  the 


component  is   a  BORDER  and  the  component     jfc.         is   an  INTERIOR, 

(BORDERS  may  occupy  two,    three,   or  four  sides  of    the   square.      Other 
examples   are      2  ,  ,    and  .)      Thus,    a  frame   for  any 

two-component   character  consists  of   two   sub-frames  of  equal   area. 


The  only  such  frames   are 


•> 


,  and 


□ 


.   The 


single-component  character   fM  .1  is  represented  by  the  frame 
and  the  component  \®j      *"s  a  FR^E# 

Based  on  the  discussion  so  far,  in  particular  the  notions  of 
frame  and  position  class,  we  can  represent  the  above  four  characters  as: 


W 


E 


for 


fa 


for 


fr 


E 


and 


for   J+3   ,  where  the  symbols  in  the  frame  are  obvious 


abbreviations  for  the  position  class  names.   In  fact,  the  grammar 


provides  further  information  for  W,  E,  N,  and  S  in  the  form  of  sub- 


scripts on  these  symbols  and  these  will  be  discussed  just  below. 


There  are,  to  be  sure,  characters  of  greater  complexity  than 


those  mentioned  above.   For  example,  there  is  the  character 


whose  frame  can  be  viewed  as  being  derived  from  simple  frames  by  means 


of  a  process  of  frame -embedding.     A 


2*        is  represented  by  the  frame 


,  which  is  derived  by  means  of  the  embedding  of 


in 


the  EAST  part  of 


Note  that  here  embedding  results  in  two 


sub-frames  of  equal  area,  the  sum  of  their  areas  being  equal  to  the 


area  of  the  sub-frame  not  embedded  in.  Frames  for  other  complex  char- 


acters are  derived  by  means  of  similar  embeddings.  1/ 


A  final  word  on  positioning.   Most  components  occur  in  many  but 


not  all  positions.   Only  a  few  occur  in  only  one  position.  As  was 


* 


seen  above j  the  component  A\^       is  a  WEST  and  an  INTERIOR;  it  is 


1/  Actually,  any  frame  except 


is  derived  via  frame-embedding. 


See  page  11,  ff, 


also  an  EAST,  a  NORTH,  a  SOUTH,  and  a  FREE.   The  component   13    is 
a  WEST  and  an  EAST,  and  nothing  else.   The  component   3^  is  a  BORDER 


only. 


The  broad  outline  of  the  phrase  structure  of  Chinese  characters 


arises  out  of  this  classification  of  components  into  position  classes. 


However,  if  a  grammar  were  constructed  to  represent  this  and  only  this 


classification,  it  would  be  inadequate  because  it  would  generate  a 


great  deal  of  unacceptable  output.   Further  inspection  of  the  co- 


occurrence properties  of  components  has  prompted  us  to  sub-classify 


each  position  class  according  to  the  dimensions  of  strength  and 


adjunctiveness. 


Before  discussing  these  two  co-occurrence  properties,  we  will 


have  to  introduce  the  concept  of  "in  construction  with".   Two  components 


are  in  construction  with  each  other  if  they  occur  in  sub-frames  of 
equal  area.   For  example,  in  A-fz\        >  /£        and   fcj    are  in  con- 
struction with  each  other.   The  frame  for  ^0    >  which  is 


contains  two  sub-frames  of  equal  area,  one  in  which  /F         occurs  and 


one  in  which   Jr)     occurs.   A  component  is  in  construction  with  a 
complex  character  sub-part  if  it  occupies  a  sub-frame  whose  area  is 


equal  to  the  sum  of  the  areas  of  the  sub-frames  occupied  by  the  com- 
ponents in  that  complex  sub-part.  For  example,   >f£    has  the  frame 


X        is  in  construction  with  the  complex  sub-part 


whose 


.     A  occupies  the  shaded  sub-frame: 

area  is  equal  to  the  sum  of  the  areas  of  the  sub-frames  occupied  by 
4  and    2<v    '  namely»  tne  shaded  sub-frames: 


m 


2 


We  are  now  ready  to  introduce  the  concept  of  strength  which  was 


mentioned  just  above.  A  component  is  strong  in  a  particular  position 


if  it  can  be  in  construction  with  many  single  components  and  with  many 


complex  character  sub-parts  while  in  that  particular  position.   It  is 


perhaps  the  case  that  many  of  the  strong  components  in  our  lexicon  are 


so  limited  in  terms  of  the  number  of  components  with  which  they  can 


occur  that  they  do  not  fit  this  definition,  under  any  reasonable  inter- 


pretation of  the  term  "many"  (even  though  the  notion  "can  be  in 


construction  with"  is  not  intended  to  be  corpus  restricted).   There 


seems,  in  fact,  to  be  a  scale  of  strength,  only  the  outlines  of  which 
are  understood  currently. 1/  At  the  present  time  we  therefore  consider 
a  component  to  be  either  strong  or  not  strong  with  respect  to  a  partic- 
ular position.   For  example,   ^s       is  strong  in  WEST  because  it  can  be 
in  construction  with   III     (in  A*H       ^»    (_,   ^n  -'jLj    ^  anc^ 
with  many  other  single  components;  and  with   q    (in  >Jq   )>  |^l 
(in  /\\o\         )>  and  with  many  other  complex  character  sub-parts.   Simi- 
larly, f*%\      is  a  strong  SOUTH,   ll     is  a  strong  EAST,  and   + +       is 
a  strong  NORTH.   All  BORDERS  are  strong.   INTERIORS  (since  no  INTERIOR 
can  occur  with  many  BORDERS)  and  FREES  (since  strength  is  a  property  of 
components  in  construction)  are  not  strong.   It  is  typical  that  a 
component  which  may  occur  in  several  positions  is  strong  in  one  or  more 
positions  and  not  strong  in  others.   A  good  example  is   (__»   •   It  i-s 
strong  in  SOUTH  and  not  strong  in  EAST,  WEST,  and  NORTH. 


1/  See  page  55  for  a  discussion  of  this  problem.  Much  more  informant 
work  is  needed  before  this  limitation  can  be  corrected. 


A  property  of  certain  strong  components  is  ad junctivenes  s. 1/ 
A  strong  component  is  either  an  adjunct  or  it  is  not  regardless  of 
position.   That  is,  an  adjunct  is  an  adjunct  wherever  it  occurs.   An 
adjunct  is  a  strong  component  which  cannot  occur  in  FREE.   Examples  of 
adjuncts  are  A  and    *    in  WEST,     •]    and   ]5   in  EAST, 

/i»\    and   /!(_,   in  SOUTH  and  -^—   and   V — 7       in  NORTH.  Adjuncts 
have  the  further  property  that  they  may  never  be  in  construction  with 
adjuncts.   Thus  the  structure   —~ ~   is  not  an  acceptable  character 
since  both   ^n>   and   — J —   are  adjuncts.   In  fact,  it  was  in  order 
to  avoid  generating  such  unacceptable  structures  that  the  dimension  of 
adjunctiveness  was  established. 


1/  Rankin  (4)  recognized  both  strong  and  non-strong  adjuncts  (p. 32) 
Further  informant  work  has  prompted  us  to  reassign  non-strong  adjuncts 
either  to  the  class  of  strong  adjuncts  or  to  the  class  of  FREE 
components.  This  solution,  incidentally,  seems  to  be  not  entirely  sat- 
isfactory. 


Since  they  must  be  strong  components,  adjuncts  occur  only  in  WEST, 


EAST,  NORTH,  SOUTH,  and  BORDER.   Among  the  strong  components  in  each  of 


these  position  classes,  some  are  adjuncts  and  some  are  non-adjuncts. 


Based  on  the  discussion  so  far,  we  can  give  the  grammar's  full 


representation  of  the  four  sample  characters  given  on  pages  2  and  3.   For 


^"F|    ,  there  is 
1^.1      ,  there  is 


% 


;  for   %J       ,  there  is 


Ef 


#- 


,  and  for  {*1  k      ,  there  i 


NsF 

•      fnr 

sf 

j      —»»«. 

re    is 

F 

W   determines  a  WEST  component  which  is  strong  (by  subscript  s)  and 
sf 

non-adjunctive  (by  subscript  f  indicating  that  the  component  can  occur 


in  FREE.)  E  determines  any  EAST  component  by  lack  of  subscript,  for 


any  EAST  can  be  in  construction  with  a  strong  non-adjunctive  WEST. 


N  _  determines  a  strong  (by  s)  adjunctive  (by  f  )  NORTH.    S  determines 
sf  f 

a  non-adjunctive  (by  f)  strong  or  non-strong  (by  lack  of  s)  SOUTH. 


(Note  how  the  constraints  on  the  occurrence  of  strong  components  and 


adjuncts  are  governed  by  these  subscripts.)  B,  I,  and  F  are  subscript- 


less  in  the  grammar,  B  always  being  strong,  I  and  F  being  insensitive 


to  the  strength  dimension,  but  both  I  and  F  being  non-adjunctive. 


One  of  the  processes  involved  in  the  formation  of  Chinese  char- 
acters is  that  of  component  repetition.  We  here  recognize  four  1/ 
types  of  repetition;  horizontal  continuous,  as  in   Q§    5  vertical, 
as  in   ^_     j  triangular,  as  in   qq     j  and  horizontal  discontin- 
uous, as  In  X  X    "  With  the  exception  of  the  horizontal  discontin- 
uous cases,  all  2/  instances  of  repetition  function  in  the  grammar  as  non- 
strong,  non-adjuncts  in  WEST,  EAST,  NORTH,  SOUTH,  INTERIOR,  and 
FREE.  The  horizontal  discontinuous  cases  function  as  strong  adjuncts 


1/  For  a  treatment  of  repetition  which  recognizes  the  existence  of  more 
than  four  types  and  which  is  non-generative  in  nature  see  Rankin  (4) . 
In  a  generative  treatment,  such  as  we  are  presenting  here,  we  feel 
that  only  these  four  should  be  recognized. 

2/  There  are  a  few  exceptions  in  this  observation.   Some  structures, 
like   4*f   and  E7Q  ,  could  be  generated  by  repetition  rules,  but  are 
not,  because  they  are  strong  components  and  it  is  desirable  that  strong 
components  be  listed  in  the  lexicon  as  single  components,  This  in- 
clusion of  complex  structures  in  the  lexicon  is  one  of  the  limitations 

in  the  current  treatment. 
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in  BORDER  and  do  not  occur  in  any  other  class. 


These  instances  of  repetition  are  represented  in  the  grammar  as 


follows: 


m 


as 


(horizontal  continuous), 


as 


□ 


(vertical),   *~L   as 


(triangular)  ,  and  $    %J(        as 


(horizontal  discontinuous). 


B.    DETAILS 


What  follows  now  is  a  rather  more  detailed  account  of  some  of  the 


phenomena  and  processes  that  have  just  been  introduced. 


The  process  of  frame-embedding  has  been  mentioned  briefly.   In 


this  detailed  treatment,  we  will  use  the  notions  of  host-frames, 


occupant-frames,  and  product-frames. 


Further,  we  will  highlight  the  process  of  frame-embedding  by 


ignoring  all  marks  which  in  the  grammar  may  occur  in  frames  or  sub- 


frames   -  except  for  the  mark  H  (for  "host"),  which  is  the  recursive 


element  in  the  grammar  that  activates  the  process  of  frame -embedding, 


We  will  refer  to  the  sub-frame  containing  H  as  the  "H  sub-frame". 


We  will  use  the  formula: 


11 


Occupant-frame  +  Host-frame  -»  Product-frame 


which   is   read  "Embed  the  occupant -frame   in  the  host-frame   to   produce 


the   product-frame".      The  host-frame   always  has  H  in  one  of    its   sub- 


frames    (or   in  its  only  sub-frame  in 


H 


).      This   one   sub-frame  may 


be  thought   of   as  the  hypothetical   square  or   size  and  shape  transfor- 


mations  of    it.      The  occupant   frame    (always  of   the  form: 


•    □ 


)   may  or  may  not  have  H   in  one   of    its 


sub-frames.      The  product-frame  of   any  embedding  will  have  H  in  one  of 


ics   sub-frames   if   the  occupant-frame  that  produced  it  has   an  H  sub- 


frame.      The   product-frame   of  the  first  embedding   is   always    identical    to 


the  occupant-frame  of  the  first  embedding. 


Sample   Single  Embeddings: 


Occupant-frame   +  Host-frame  -*  Product-frame 


+             H 

□ 

+         H 

♦     0 

ffl 
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In  general,  H  may  occur  in  any  sub-frame  except  the  border  sub- 
frame,  since  border  is  not  a  possible  transformation  of  the  hypothetical 
square.   (For  purposes  of  this  discussion,  "border"  sub-frames  can  be 
filled  by  components  from  both  the  BORDER  position  class  and  the 
horizontal  discontinuous  repetition  class.)   Thus  there  are  no  complex 
borders. 1/ 

If  a  product-frame  contains  an  H  sub-frame,  that  product-frame  is 
then  redefined  as  a  host-frame.   This  redefinition  is  shown  in  the 
following  example. 


1/  Actually  there  are  some  BORDERS  which  appear  to  be  complex. 
Examples  are  &  and  l^\  •   Complexity  in  BORDER  is  extremely  rare 

and  we  circumvent  the  problem  by  listing  such  instances  as  the  two 
above  as  single  components  in  the  lexicon.  Further,  all  instances  of 
horizontal  discontinuous  repetition  appear  to  subdivide  the  border 
sub-frame.  However,  they  do  conform  to  our  general  definition  of 

BORDER  in  that  they  occupy  two  sides  of  the  hypothetical  square  (see 

page  3). 

13 


Sample  first  embedding  followed  by  second  embedding: 


Occupant-frame  +  Host-frame  -*  Product-frame 


1st 
embedding: 

2nd 

embedding: 


H 

+ 
+ 

H 

H 

(redef- 

H 

inition) 

/ 

Note  that  in  the  first  embedding  one  of  the  sub-frames  of  the 


occupant-frame  contains  H.  It  is  this  occurrence  of  H  that  both  allows 


the  second  embedding  to  take  place  and  also  marks  the  place  where  the 


second  is  to  take  place.  If  there  were  no  H  sub-frame  in  the  occupant- 


frame,  no  second  embedding  could  take  place. 


Further,  no  occupant-frame  in  the  grammar  may  contain  more  than 


one  H  sub-frame.  This  restriction  imposes  a  "blocking"  of  embedding. 


Once  an  occupant-frame  has  been  embedded  in  a  host-frame  in  any 


embedding,  the  sub-frame  which  is  equal  in  area  to  the  H  sub-frame  is 


permanently  blocked  from  embedding.   Another  way  of  saying  this  is  that 


the  grammar  does  not  allow  any  host-frame  to  contain  more  than  one  H, 


In  the  following  sample  products  the  shaded  sub-frames  are  blocked 


from  embedding: 


m. 


H 


H 

W, 


m 


i 


& 

Wk 

v/zs, 

I 

1 

w 


m 


i 


H 
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There  are  further  constraints  on  the  process  of  frame-embedding. 
First,  there  is  an  upper  limit  of  four  embeddings  in  the  generation  of 
any  character- -giving  rise  to  a  maximum  of  five  sub-frames  per  frame. 
Second,  there  may  be  only  two  BORDERS  in  the  generation  of  a  frame  for 
any  character.   (With  respect  to  this  constraint,  cases  of  horizontal 

discontinuous  repetition  are  treated  as  BORDERS).   Finally,  there  is  a 
limit  of  four  sub-frames  per  frame  if  any  sub-frame  is  occupied  by  a 

case  of  horizontal  continuous,  vertical,  or  triangular  repetition.  This 
is  so  because  these  cases  of  repetition  are  in  a  sense  multi-compon- 

ential.  In  contrast,  cases  of  horizontal  discontinuous  repetition 
function  as  a  single-component  BORDERS. 

Blocking  of  embedding  is  closely  related  to  the  functioning  of 

strong  components  in  the  following  sense.   A  strong  component  must 

occur  in  every  blocked  sub- frame  of  a  frame.   Further,  a  terminal  frame 

contains  within  it  two  sub-frames  of  equal  area  (shaded  in  the  following 

and        ) .  A  strong  component  must 


examples: 


H 

I 


I 

1 


occur  in  at  least  ©na  of  these  two  sub-frames  of  equal  area.   It  follows 


that  an  output  character  contains  at  most  one  non-strong  component. 
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With  respect  to  adjuncts,  they  may  occur  wherever  any  strong 


component  may  occur  with  one  exception.   There  are  no  cases  of  adjuncts 


being  in  construction  with  adjuncts.   So,  not  both  of  the  sub-frames 


of  equal  area  may  be  occupied  by  adjuncts.  That  is,  for  example,  not 


both  shaded  sub-frames  in  any  one  of  the  following  frames  may  be 


occupied  by  adjuncts: 


It  follows 


that  in  any  output  character  there  must  be  at  least  one  non-adjunct. 


Finally,  not  both  sub-frames  of  equal  area  in  any  frame  may  be 


occupied  by  the  same  component.   If  it  were  otherwise,  such  output 


characters  as   M£j   would  be  ambiguous  (or  derivable  in  more  than 


one  way).   One  derivation  of  FJH   would  be  an  EAST  plus  a  WEST 
with  pj    filling  both  positions;  the  other  derivation  would  be  the 


correct  one:  that  {EluJ   is  a  case  of  horizontal  continuous  repetition, 


II.   THE  GRAMMAR 


A.    INTWDUCTORY 


The  grammar  is  set  up  to  generate  output  characters  in  three 


stages.   The  grammar  also  imposes  a  hierarchical  classification  on  the 


16 


set  of  all  output  characters,  and  this  can  best  be  seen  by  first 
discussing  the  three  stages  of  the  generative  process. 

Stage  one  is  a  string  of  symbols  generated  by  a  finite  state 
diagram.   This  string  controls  the  number  and  type  of  embeddings  in 
the  output  character.  1/  Stage  two  is  a  two-dimensional  constituent 
array  (generated  via  the  process  of  frame  embedding)  which  determines 
the  two-dimensional  phrase  structure  of  the  output  character  and  governs 
the  positioning,  strength,  and  adjunctiveness  constraints  on  the  compon- 
ents to  be  selected.   Stage  3  is  the  output  character  itself:  a  frame 
filled  with  components  selected  (via  a  lexical  look-up  procedure) 
from  the  lexicon. 

The  hierarchical  classification  is  as  follows.   Any  stage  1 
string  determines  a  number  of  constituent  arrays  and  any  stage  2  frame 
determines  a  number  of  output  characters.   Thus,  the  set  of  all  strings 


1/  The  various  constraints  and  phenomena  discussed  here  in  terms  of 

the  grammar  have  already,  of  course,  been  discussed  in  Section  I, 

to  which  the  reader  is  now  referred. 
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permitted  by  the  finite  state  diagram  ultimately  determines  the  set  of 


all  output  characters.   The  hierarchy  referred  to  can  be  depicted  as 


a  tree  structure,  as  follows: 


String: 


Constituent  array: 


Output  character: 


State  Diagram 


The  path  from  KM  to  ^ 


to  the  output  character 


H 


gives 


the  derivation  of  this  output  character  in  terms  of  how  the  grammar 


generates  it.  The  dashed  line between  nodes  in  the  tree  in- 


dicates that  in  general  there  are  many  (though  finitely  many)  other 


nodes  at  the  same  stage  dominated  by  the  same  node  of  the  previous 


stage.  1/ 


1/  Or  in  the  case  of  Stage  1,  other  nodes  dominated  or  generated  by 


the  state  diagram. 


B .    DETAILS 


To  produce  Stage  1: 


Stage  1  is  an  output  string  from  a  finite  state  diagram,  which 


consists  of  a  number  of  states  with  input  and  output  transitions.   One 
state  is  distinguished  by  having  only  output  transitions,  and  this  is 


the  initial  state.  One  state  is  distinguished  by  having  only  input 


transitions,  and  this  is  the  final  state.   Along  the  transition  arrows 


there  are  either  one  or  two  output  symbols  (K,  G,  H,  H',  M,  M').  1/ 


MM' 


1/  These  symbols  are  explained  on  pages  21  and  22, 
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The  initial  state  is  entered  and  a  process  of  random  transition 
from  the  initial  state  to  the  final  state  (perhaps  through  intermediate 
states)  is  begun.   At  each  transition  from  state  to  state  one  output 
symbol  is  produced.  If  there  is  only  one  symbol  along  the  transition 
arrow,  that  symbol  is  necessarily  chosen.   If  there  are  two  symbols 
along  the  transition  arrow,  one  is  chosen  at  random.  The  next  transi- 
tion causes  a  symbol  to  be  produced  to  the  right  of  the  last-produced 
symbol,  and  so  on.  As  the  final  state  is  entered,  the  last  (right- 
most) symbol  is  produced,  and  the  process  terminates,  producing  the 


string  which  is  Stage  1:  x   x....x    x.l/ 

1   2  •    n-1  n  ~ 


1/  Since  n  cannot  exceed  5,  it  might  seem  inappropriate  to  set  up  the 
general  case  for  this  and  all  the  other  stages  and  processes  in  the 
grammar.  However,  by  so  doing,  we  have  achieved  ease  of  manipulation 
and  generality  for  such  future  applications  as  the  study  of  frame- 
embedding  as  an  isolated  process. 
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Sample  output  strings  of  the  form  x   x   . . .x    x  ,  generated 

1   2     n-1  n 

by  the  state  diagram,  are:   KM*  (by  taking  the  topmost  path  through  the 
diagram),  G  (by  taking  the  bottom-most  path  through  the  diagram),  KHM 
and  KHHM  (by  taking  two  of  the  middle  paths  through  the  diagram) . 

The  nature  of  any  string  generated  by  the  state  diagram  determines 
the  size  and  complexity  of  the  final  output  character.   For  example  no 
string  may  exceed  length  five,  and  this  restriction  guarantees  that  no 
output  character  shall  have  more  than  five  components. 

Each  symbol  in  the  string  represents  a  set  of  frames  (see  the 
correspondence  table  on  pages  23  and  24).   The  frames  of  each  set  have 
certain  properties  in  common,  related  to  the  embedding  process,  and  to 
the  border-non-border  distinction.   Border  here  includes  not  only  the 
position  class  BORDER  but  also  the  discontinuous  repetition  cases, 
which  are  represented  by  the  same  frame  configuration  as  the  BORDER 
position  class  and  which  function  in  the  grammar  subject  to  the  same 
restrictions. 

H     V  is  the  one  frame  which  is  a  host  only.   That  is,  it 
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is  the  initial  step  with  respect  to  frame  embedding. 


G= 


>  is  the  one  case  of  non-embedding.   It  is  neither 


a  host  nor  an  occupant. 


H=  ' 


H 

Es 

H 


,  and  H ' =  • 


are  the  sets  of  frames  which  function  as  both  hosts  and  occupants. 


They  are  the  intermediate  steps  with  respect  to  frame  embedding.  They 


are  distinct  sets  in  that  H  is  non-border  and  H'  is  border. 


M= 


'"st 

Efi 

,  and  M'=  « 


are  the  sets  of  frames  which  function  as  occupants  only.   They  are  the 


terminal  steps  with  respect  to  frame  embedding.  They  are  distinct  in 


that  M  is  non-border  and  M'  is  border. 


Finally,  the  state  diagram  (by  distinguishing  H  from  H'  and  M 


from  M'  and  by  prohibiting  certain  transitions)  guarantees  that  no 


output  string  may  contain  more  than  two  instances  of  "symbol -prime". 


Thus  it  guarantees  that  no  output  character  may  contain  more  than  two 


borders. 
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To  produce  Stage  2: 


And  now,  before  discussing  the  process  that  generates  Stage  2, 


we  will  display  the  output  symbol -to-frame  correspondence  table  referred 


to  above.  Each  output  symbol  of  the  state  diagram  corresponds  to  a  set 


of  frames.  These  output  symbols  appear  as  column  headings.   The  filled 


frames  are  listed  underneath. 


Table  of  Correspondences  Between  Output  Symbols  from  Stage  1  and  Frames: 


K 


H 


H« 


M 


M« 


H 

F 

H 

E 
s 

B 

H 

W  - 
sf 

Ef 

tf 

W 
s 

H 

D 

sf 

E 
s 

D 

0 

H 

W 
sf 

E 

Ss 

Ns 

NsF 

H 

sf 

Nsf 

Ss 
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H 


H' 


M 


M' 


sf 


There  are  two  steps  in  the  process  that  generates  a  Stage  2 
constituent  array  from  Stage  1  string.   First,  a  sequence  of  frames 
is  created.   Second,  if  there  is  more  than  one  frame  in  the  sequence, 
these  frames  are  compiled  into  a  constituent  array. 

First,  each  symbol  in  any  string  of  Stage  1  corresponds  to  a 
column  in  the  table  of  correspondences.  To  obtain  the  sequence  of 
frames,  select  at  random  one  frame  from  each  column  indicated  by  the 
Stage  1  string.   Thus  a  sequence  of  frames  is  created:   E  . . .E  .   The 
frames  of  the  sequence  E,...E   consist  of  sub-frames  which  contain 
either  or  both  of  two  types  of  marks:   (1)  the  mark  H  (for  Host)  which 


sets  up  the  process  of  frame  embedding  and  (2)  the  marks  called  con- 
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stituents  which  will  later  be  used  in  the  process  of  component  select- 
ion. At  this  stage  in  the  process,  since  we  are  concerned  with  frame- 
embedding,  only  the  frame  configurations  and  the  mark  H  will  be  relevant 
to  our  discussion. 

Second,  when  there  is  more  than  one  frame  in  the  sequence,  the 
process  of  frame- embedding  is  initiated  so  as  to  compile  the  sequence 
of  frames  into  a  constituent  array. 

To  produce  the  constituent  array,  apply  the  frame- embedding  formula: 

Occupant-frame  +  Host-frame  -»  Product-frame 

to  the  sequence  of  frames: 

E   E„  ....  E  1  ^  n  ^   5 

12        n 

The  frame -embedding  process  may  be  viewed  as  a  recursive  process 
initiated  by  E  which  is  the  hypothetical  square,  or  host-frame  of 
the  first  embedding.  Hence  we  will  set  E..  =  H..  ;  H..  for  the  first 

host-frame.   The  subsequence  E E  is  the  set  of  oc cup ant -frames 

which  are  called  upon  during  the  embedding  process,  one  occupant-frame 
necessary  for  each  embedding.   Note  that  where  there  is  only  one  frame 

■2.5 


in  the  sequence,  it  is  not  possible  to  define  either  host  or  occupant 
frames. 

Embed  occupant-frame  E   in  host-frame  H  to  obtain  product- 
frame  H2.   Redefine  H2  as  the  host-frame  of  the  second  embedding.  Embed 
E  in  H_  to  obtain  product-frame  H_.   Continuing  in  this  way,  embed 
occupant-frame  E.  in  host-frame  H._,   to  obtain  frame  H-,  etc.,  until 
E  is  embedded  in  ILi  to  obtain  the  final  product-frame  H  ,  which  is 
defined  to  be  the  constituent  array. 

A  distinction  should  be  made  between  H  and  H. »  i=l>  2,  ...,  n. 
H^  is  a  well-formed  frame  with  an  H  in  one  sub-frame.  H  is  a  mark 
which  indicates  the  precise  location  of  embedding.  H, =  E,  is  the 

hypothetical  square  with  an  H  inside,    h     •  Size  and  shape  trans- 

______ 

formations  of  this  hypothetical  square  are  created  automatically  as 
we  go  through  the  recursive  process.  In  other  words,  the  sub-frame 
containing  an  H,  of  a  product-frame  H.,  is  considered  to  be  the  trans- 
formed hypothetical  square. 

H,  which  is  present  in  one  sub-frame  of  each  E-  except  E  and  each 
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H-   except  R     (constituent  array),    acts   as   the   indicator  of   the   sub- 


frame   in  which  the  next  embedding  is   to  occur.      In   this  way  H  is  the 


activator  of   the  frame- embedding  process.     When   an  H  fails   to  appear  in 


a  product-frame  after  an  embedding  has   taken  place,    that  product-frame 


cannot  be  redefined  as   a  host-frame,    and  the  process   stops.      That  final 


product-frame  where  H  fails   to   appear   is,   of   course,   H   ,    the  constituent 


array. 


This  discussion  of  the  recursive  process  of  embedding  occupant- 


frame  is  host-frame  to  produce  product-frame  and  redefining  product- 


frame  of  the  previous  embedding  as  host -frame  and  embedding  again,  until 


the  constituent  array  (Hj.)  is  obtained  may  be  expressed  in  compact  form 


as  the  following  sequence  of  formulas  where  +  is  the  embedding  operation: 


E2  +  Hx  -•  H2 


E„  +  H  _  H 
3     .2    .3 


E,  +  H^^ 


1  <:  n  <  5 


H- 


E   +  H   -  H 
n     n-1   n 


27 


For  each  occupant-frame  +  host-frame  embedding  a  product-frame  is 
produced.   This  product-frame  is  then  the  host-frame  of  the  next 
formula  in  the  sequence.   The  recursive  nature  of  the  embedding  process 
is  evidenced  by  the  generation  of  a  sequence  of  formulas  in  which  there 
is  a  successive  production  of  product-frames  and  redefinition  as  host- 
frames.   The  sequence  ends  with  the  production  of  the  final  product- 
frame  H  ,  the  constituent  array. 

An  example  of  the  generation  of  a  constituent  array  from  an  output 
string  is  now  given: 
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String 


KHM 


Sequence  of  frames 


Constituent  array 


Hi 

Host  frame  of 
1st  embedding 


1 

2 

3 

H 

W 
s 

H 

Nsf 

Ss 

(Occupant  frames) 


Product  frame  of 
1st  embedding 
(redefined  host  frame) 


■ 

/ 

Ws 

Nsf 

Ss 

H, 


Product  frame  of 
2nd  embedding 
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The  string  KHM  gives  the  sequence  of  frames  E  ,  E„,  E-,  and 


these  frames  then  enter  into  the  embedding  process: 


Occupant-frame  +  Host-frame  -*  Product-frame 
Ei    +    Hi-1    -     Hi 


The  process  is  repeated  in  this  case  until  Eg  is  embedded  in  H2  to 
produce  H~,  the  constituent  array. 
To  produce  Stage  3 : 

We  will  now  describe  the  process  which  generates  an  output  char- 
acter from  a  Stage  2  constituent  array.  The  process  selects  a  component 
in  the  lexicon  for  each  constituent  which  occurs  in  the  constituent 
array  of  Stage  2.  The  constituents  are  handled  in  order  according  to 
the  size  of  the  sub-frames  that  they  occupy  -  from  largest  to  smallest. 
For  constituent  arrays  having  two  sub-frames  of  equal  size  a  diagonal 
order,  from  top-left  to  bottom  right  is  followed. 1/   Specifically,  the 


1/  Actually,  this  corresponds  to  the  traditional  stroke  order.  A  person 
trained  in  traditional  calligraphy  writes  most  Chinese  characters  start- 
ing at  top-left  of  the  hypothetical  square,  and  finishing  at  bottom-right. 
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"north"   sub-frame  precedes   the   "south"   sub-frame;    "west"  precedes   "east"} 

and   "border"  precedes   "interior." 

These   constituents  are  of   the   form     £    ,   where  £   is   a  symbol   cor- 

a 

responding  to  one  of  the  position  classes,  or  to  one  of  the  repetition 

classes,  and  a  Is  a  subscript  which  gives  information  on  strength  and 

adjunct iveness.   The  combination  of  £a's  allowed  guarantees  that  each 

output  character  has  at  least  one  non-adjunct,  and  at  most  one  non-strong 

component  (see  pages  15  and  16). 

For  instance,  for  the  Stage  2  constituent  array  Wa  ,  the 

S 

constituents  are  (in  order)  W  .  N~,  and  S.  In  the  case  of  W_,  we  have 

S    SE  O 

2  as  W  and  a,  is  the  single  subscript  s.  For  N  f  we  have  E  as  N  and  a.  is 
the  double  subscript  sf .  Finally,  S  is  an  example  of  E  with  a.  as  the 
null  subscript. 

Next  it  will  be  necessary  to  describe  the  lexicon,  an  essential 
tool  used  in  the  creation  of  stage  3. 
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The   lexicon   is  a  table  with   components   1/     down   the  side   for  each 
row,    symbols   across  the   top  for  column  headings,    and  tallies   at   certain 
row-column  intersection  points  signifying  that  the   components  may  occur 
in  the  positions  signified  by  column  headings  W,   E,   N,   S,   B,   I,   or  F,    2/ 
or  may  undergo   the     repetition  processes   signified  by  V,   C,   T  or  D.   3/ 


1/     Each  of   the   components  has  a  unique  number  associated  with   it.      These 
numbers  are  not  a  part  of  the  formal   apparatus  of   the  grammar.     We  are 
including  them  for  the  purpose  of  making  lexical   look-up  easier  for  those 
readers  familiar  with  Chinese  characters.     The   component  numbers   are  of 
the  following  form:      S.T„N,   where  S   is  the  number  of   strokes   in  the 
component   (S*l-19),  T=l-8   characterizes  the  types  of   the  last   stroke 
(last   in  the  sense  of  traditional   stroke  order)   of  the  component:    1  for 
horizontal,   2   for  vertical,    3   for  dot,    4  forV-like,    5  for  S  -like,    6 
for  /  -like,    7   for  hooked,    and  8  for  multi-directional.     N  is  the  number 
of  the  component   in  the   sub- list. 

2/  These  column  headings  are  abbreviations  of  the  position  classes 
discussed  in  Section  I.  W*WEST,  E=EAST,  N=NORTH,  S=SOUTH,  B=BORDER, 
I=INTERIOR,    and   F=FREE. 

3/     These  column  headings  are  abbreviations  of  the  repetition  pro- 
cesses discussed  in  Section  I.     V=Vertical   repetition,   C=horizontal 
Continuous  repetition,   T=Triangular  repetition,    and  D=horizontal 
Discontinuous  repetition. 
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Sample  Lexicon: 

(The   superscripts  on  some   of   the  tallies  are   to  be   ignored.      Their 

significance  will   be  explained  on  page   74.) 


3.2.11 

4 

3.3.3 



4.4.4 

1 

6.3.13 

k 

6.8.3 

/u. 

7.1.5 

7.3.3 

1 

WEN 


V  C  T         D 


In  general,  the  lexicon  will 


be  used  in  the  following  way. For 


any  constituent  T.    •  we  examine  the  column  specified  by  Z,  where  £  is 

a 

any  one  of  the  eleven  column  headings,  to  determine  the  set  of  rows 
which  have  tallies  corresponding  to  the  subscript  a» 

If  a  is  f  or  includes  either  f  or  f  (such  as  in  S.,  or  N  f), 
we  must  further  determine  the  subset  of  these  rows  which  also  have 
tallies  in  column  F.  Note  that  the  column  F  hag  two  functions.   First, 
it  is  one  of  the  columns  denoted  by  E.   Second  (as  In  this  part  of  the 
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discussion),  it  is  a  column  to  be  searched  for  adjunctiveness  infor- 
mation in  conjunction  with  the  search  in  column  £. 

The  following  chart  summarizes  the  types  of  subscripts  which 
may  occur  in  each  constituent  mark,  and  the  types  of  tallies  which 
match  the  subscripts. 
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Subscripts* 

Tallies 
by  a 

activated 
Lnt 

1  . 

Explanation 

Column  E 

Column  F 

0 

8,  »,  X 

The  null  subscript  specifies 
a  row  which  has  any  tally  in 
the  S  column. 

8 

8 

The  s  subscript  specifies  a 
row  which  has  an  s  tally  in 
E. 

f 

8,  B 

X 

The  f  subscript  specifies  a 
row  which  has  both  an  x 
tally  in  column  F,  and  either 
an  s  tally  or  an  F  tally  in 
column  E. 

BE 

s~ 

X 

The  Sf  subscript  specifies 
a  row  which  has  both  an  s" 
tally  in  column  E,  and  an 
x  tally  in  column  F. 

sf 

8 

X 

The  sf  subscript  specifies 
a  row  which  has  both  an  s 
tally  in  column  E,  and  an 
x  tally  in  column  F. 

sF 

8 

x" 

The  sf  subscript  specifies  a 
row  which  has  both  an  s  tally 
in  column  E,  and  no  x  tally 
in  column  F. 

*  Note  that  ST  does  not  occur.  See  page  8  footnote. 
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For  each  of  the  first  two  cases  (0  and  s)  a  set  of  rows  is 


determined  by  searching  column  S.   For  each  of  the  last  four  cases 


(f,  sf ,  sf,  and  sf)  a  set  of  rows  is  determined  by  searching  column  £, 


and  further,  a  subset  of  this  set  is  determined  by  searching  column  F. 


At  this  point,  a  single  row  is  randomly  selected  from  the  set  or  subset 


thus  determined,  and  the  component  (if  the  symbol  of  the  constituent 


is  W,  E,  N,  S,  B,  I,  F)  or  repetition  of  the  component  (if  the  symbol 


of  the  constituent  is  V,  C,  T,  or  D)  on  that  row  replaces  the  constituent 


being  processed.   This  restriction  prevents  the  class  of  ambiguous 


derivations  mentioned  on  page  16. 


The  lexical  search  procedure  can  be  summarized  in  the  following 


way: 

Constituent  array 

Constituents  


Ta 


Column  Headings 
Lexical  search 


Components 


c 


SCI 


tallies  tallies 


Ta 
El  F 


tallies 


c 


-  Output  character 
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An  example  of  the  generation  of  an  output  character  from  a 


constituent  array  is  now  given. 

The  stage  2  constituent  array, 


w 

W8 

,  is  processed  to 


generate  an  output  character  in  the  following  fashion. 


First,  the  constituents  are  ordered  according  to  the  decreasing 


size  of  the  sub-frames  that  they  fill.   Since  there  are  two  terminal 


sub-frames  of  equal  size,  the  order  of  these  last  two  is  determined  by 


the  diagonal  line  procedure. 


Then,  the  sets  or  subsets  of  rows  which  have  tallies  correspond- 


ing to  each  constituent  are  determined.   In  other  words,  column  W  will 


be  searched  for  the  set  of  rows  which  have  tallies  corresponding  to  sub- 


script s.  Column  N  will  be  searched  for  the  set  of  rows  which  have 


tallies  corresponding  to  subscript  s",  and  further,  we  determine  the 


subset  of  this  set  of  rows  having  x  tallies  in  column  F.   Finally 


column  S  will  be  searched  for  the  set  of  rows  which  have  tallies 


corresponding  to  subscript  s.   To  illustrate  what  is  taking  place  at 


this  point,  the  tallies  corresponding  to  W  will  be  circled  once  in 

s 
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the  sample  lexicon,  the  tallies  corresponding  to  N-  will  be  circled 

sf 

twice,  and  the  tallies  corresponding  to  S  will  be  circled  three  times, 


3.2.11 


3.3.3 


4.4.4 


6.3.13 


6.8.3 


7.1.5 


7.3.3 


H 


t 


X 


* 


A-* 


WENSBIFVCTD 


© 
© 


Then  one  of  the  rows  in  each  set  or  subset  is  picked  at  random, 


with  the  constraint  that  the   row  selected  for  N-  may  not  be  the  same 

sf   ' 

as  the  row  selected  for  S  ,  since  they  are  in  construction  with  each 
other.   Again,  one,  two  and  three  circles  will  be  used  to  illustrate 
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the  tallies  corresponding  to  the  sets  or  subsets  of  tallies. 


3.2.11 


3.3.3 


4.4.4 


6.3.13 


6.8.3 


7.1.5 


7.3.3 


'A 


t 


X 


* 


fc 


a 


0 


WENSBIFVCTD 


Finally,  the  component  which  occurs  on  the  row  of  the  tally 


just  selected  replaces  2  . 

a. 
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This  example  can  also  be  represented  in  the  following  way: 


Constituent  array 


w 
ws 

s8 

Constituents 


W. 


V 


Column  headings 


Lexical  search 


Components 


N 


a 


f        k  *. 


Output  character 


f 



1 

X 
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Sample  Derivation  from  Total   Grammar; 


String 


K  H  H  M 


Constituent  array 


Ws 

C 

SS 

Output  character 


1 

a 
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III.   EVALUATION 


A.    GENERAL  REMARKS 

How  are  grammars  for  component  combination  to  be  evaluated?  There 
are  two  criteria  which  are  used  here:   completeness  and  tightness. 
These  criteria  are  well-known  in  the  literature  of  natural  language 
study.   Tightness  and  completeness  taken  together  in  fact  constitute 
the  transformationists'   "minimal  requirements"  on  natural  language 
grammars  -  that  they  should  generate  all  and  only  the  grammatical 
sentences  of  the  languages  under  study.  1/ 

There  are,  of  course,  other  conditions  that  might  be  put  on  a 
successful  grammar  of  component  combination  -  conditions  having  to  do 
with  simplicity,  with  the  choice  of  a  grammar  model,  with  relative 
correctness  of  structural  descriptions  assigned  by  the  grammar  to 
output  objects,  with  integratability  of  the  grammar  with  other 
grammatical  processes  necessary  in  the  total  description  of  Chinese 


1/  See,  for  example,  Bach  (1),  page  5 
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characters,  with  possible  applicability  of  the  grammar  or  grammar  model 
to  other  two-dimensional  languages,  and  so  on.  In  this  note,  however, 
we  will  be  concerned  only  with  completeness  and  tightness. 

GCC-3  constitutes  an  attempt  to  construct  a  simple  grammar  which 
is  balanced  between  completeness  and  tightness.  It  seems  to  be  nearly 
as  simple  as  possible  in  terms  of  the  size  of  the  lexicon  of  components 
and  the  number  and  complexity  of  grammar  rules.   It  is  neither  100  per 
cent  tight  nor  100  per  cent  complete,  but  it  would  seem  that  any  ad- 
justment to  make  it  tighter  would  result  in  less  simplicity  and/or 
completeness,  and  that  any  adjustment  to  make  it  more  complete  would 
also  result  in  less  simplicity  and/or  tightness. 
1.    Completeness 

A  complete  grammar  for  the  language  of  Chinese  characters  is  one 
which  generates  all  of  the  well-formed  characters  in  the  language. 
This  is  an  extremely  difficult  condition  to  satisfy,  if  only  because 
it  is  so  difficult  to  determine  in  many  cases  exactly  what  is  and  what 
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is  not  in  the  language.  1/  The  problem  here  is  probably  no  worse  than 
the  problem  in  natural  language  study  of  deciding  for  border-line 
strings  whether  or  not  they  are  grammatical  sentences  of  the  language 
under  study. 

A  less  demanding  condition  is  that  a  successful  grammar  generate 
all  the  well-formed  characters  in  some  target  corpus. 

Such  a  grammar  will  be  called  "corpus-complete",  and  a  grammar 
that  is  complete  with  respect  to  the  language  will  be  called  "language- 
complete."  Only  corpus -completeness  will  be  used  in  evaluating  GCC-3. 
The  grammar  might  then  be  tested  against  other  corpora  for  corpus- 
completeness.   In  this  way,  the  language- complete  grammar  would  be 
approximated  by  the  grammar  which  is  corpus -complete  for  several 
corpora,  but  it  would  never  be  known  whether  the  grammar  actually 
attains  language-completeness. 


1/  Decisions  on  what  is  and  what  is  not  in  the  language  are  currently 
based  on  judgments  given  by  the  informant. 
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2.   Tightness 

A  tight  grammar  for  the  language  of  Chinese  characters  is  one  which 
generates  only  well-formed  Chinese  characters.   Tight  grammars  are  easy 
to  come  by.   A  trivial  example  would  be  a  "grammar"  which  lists  some 
small  number  of  actually-occurring  characters. 

In  a  way  analogous  to  the  completeness  situation,  we  may  speak  of 
corpus-tightness  and  language-tightness.   If  a  GCC  is  language- tight, 
then  it  generates  only  those  characters  in  the  target  corpus.   It 
turns  out  that  any  serious  GCC,  i.e.,  one  which  attempts  to  reveal 
the  internal  structure  of  the  characters  (and  does  not  simply  list 
characters),  is  almost  bound  not  to  be  corpus-tight.   This  is  an 
empirical  observation,  and  an  analogous  observation  probably  holds  for 
any  serious  attempt  at  grammar  construction  for  natural  language  corpora. 
Only  language- tightness  is  used  in  this  evaluation. 

Lack  of  completeness  implies  insufficient  output,  or  under-gen- 
eration.   Lack  of  tightness  implies  too  much  output,  or  over-generation. 
We  have  discussed  over- gene rat ion  of  output  characters  so  far,  but 
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there  is  another  type  of  over-generation,  and  that  is  the  over-genera- 
tion of  derivations  per  character. 

An  output  object  from  any  grammar  is  said  to  be  ambiguous  if  the 
grammar  can  generate  it  in  more  than  one  way,  i.e. ,  provides  more  than 
one  derivation  of  it.   The  problem  of  whether  or  not  there  are  ambiguous 
characters,  or  what  would  be  meant  by  "ambiguous  character"  have  by  no 
means  been  solved.   However,  it  seems  that  every  character  in  the 
language  is  uniquely  segmentable  into  components  by  the  informant,  and 
therefore  our  current  position  is  that  there  are  no  ambiguous  characters 
in  the  language  of  Chinese  characters.   Consequently,  a  grammar  that 
generates  any  character  is  more  than  one  way  "over-generates". 

B.    EVALUATION  OF  GCC-3 
1.   Completeness 

GCC-3   appears   capable  of   accounting  for  94%  of   the   acceptable 
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characters  in  Mathews*.  1/  The  characters  that  GCC-3  cannot  account 

for  fall  into  seven  known  small  classes.  2/  The  grammar  could  be 

adjusted  so  as  to  account  for  these  classes,  but  only  with  a  great  loss 

in  tightness  and/or  simplicity. 

First,  there  are  those  characters  which  contain  more  than  five 

components.   Since  the  state  diagram  in  the  grammars  sets  a  maximum  of 

five  components  per  character,  such  characters  cannot  be  accounted  for. 

Examples   are:      Jjjf      (       ^        ,        +-f-     ,       E)        >       J~l        >       A         '    and 

L_  )   and      ^       (+•/",       jt       »       /V       »     .±-        'A,        »    and 

/n\       ). 


1/  To  arrive  at  an  indication  of  how  complete  GCC-3  is,  we  conducted 
the  following  test.  We  selected  (via  a  random  number  table)  200 
characters  from  Mathews'.  We  then  attempted  to  generate  each  character 
from  GCC-3.  We  found  that  GCC-3  could  generate  188  of  the  200 — some 
in  more  than  one  way.   (See  the  discussion  of  ambiguity  on  pages  56-62), 
2/  With  the  exclusion  of  possible  clerical  errors. 
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Second,  there  are  those  characters  which  are  (or  which  contain  sub- 


parts which  are)  complex  in  both  NORTH  and  SOUTH  (represented  by  the 


frame 


H 


H 


H 

H 

or  in  both  EAST  and  WEST  (represented  by  the  frame 


0 

Examples  of 


) .   Examples  of 


H 


H 


are: 


in  construction 


,  -r/Li 

with      ^*-         )    and       P£         (     j£j 


in  construction  with 


^         ). 


G=7 


H 


H 


4 


are     Xfc      C 


!+-         in  construction  with 


t 


and 


0 


m  < 


_H_   in  construction  with   o,o   ) . 
f°3 


£ 


Third,  there  are  those  characters  which  contain  weak  components 
in  strong  positions:  either  a  weak  component  in  construction  with  a 
weak  component  or  a  weak  component  in  construction  with  a  complex 
character  sub-part.   Examples  of  characters  having  a  weak  component  in 
construction  with  a  weak  component  are:    £3    ,  where  £  3    and 

tti    are  in  construction  but  both  weak,  and  ^E   ,  where   ^t 
and  )il         are  in  construction  but  are  both  weak.  Examples  of  char- 
acters in  which  there  is  a  weak  component  in  construction  with  a  complex 
character  sub-part  are:  ipjj?.     >  where   7    is  weak,  but  in  construc- 
tion with  fi>        »  and  mjtj      ,  where  ~fc        is  weak  but  in  construction 
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with 

Fourth,  there  are  those  characters  which  contain  more  than  two 
BORDERS.   Since  the  state  diagram  prohibits  any  output  character 
from  having  more  than  two  BORDERS,  such  characters  cannot  be  accounted 


for  by  the  grammar.   Examples  are  J7=g   ,  which  contain  the  three 
BORDERS  X^_       ,        F3       ,  and    "1   ,  and  3ki   ,  in  which  there  ai 
the  three  BORDERS  i  ,  *~1        ,  and   j 

Fifth,  there  are  those  characters  in  which  there  is  a  BORDER  in 
construction  with  a  non-FREE  component.    Since  in  the  grammar,  the 
set  of  INTERIORS  (those  components  which  are  in  construction  with 
BORDERS)  form  a  proper  subset  of  the  set  of  FREES,  BORDER  plus  non- 
FREE  structures  cannot  be  accounted  for  by  the  grammar.   The  only 
known  examples  are  x *T   ,  in  which  the  INTERIOR   -    (in  con- 


struction with  BORDER   /     )  is  an  adjunct,  and  ^57   ,  in  which 
the  INTERIOR   /     (in  construction  with  BORDER  W    )  is  an 
adjunct. 

Sixth,  there  are  those  characters  which  contain  the  form  «. . 
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(This  form  is  not  to  be  confused  with   ,  a  component  in  the 

lexicon,  and  for  structural  reasons  cannot  be  considered  simply  a 

shortened  variant  of   .)  There  are  doubts  as  to  whether  this 

form  should  be  considered  a  component.  If  it  were,  it  would  seriously 
affect  the  tightness  of  the  grammar.  Therefore,  all  that  can  be  said 
is  that  there  are  characters  which  contain  it  and  cannot  be  accounted 
for  by  the  grammar  because  the  grammar  does  not  list  it  as  a  component. 
Examples  are   ~/j\  >  where    —    is  between  /V   and   ^    , 
and   i*L    »  where   —    is  between   *"^*   and   ./V. 

Seventh,  there  are  a  few  characters,  which  if  segmented  into 
components,  would  yield  extremely  irregular  frames.  1/  Examples  are: 
{i]\'/]\  f   whose  frame  would  be  something  like  ,  and 


1/  These  constitute  a  type  of  component  superimposition.  Component 
superimposition  is  discussed  in  Rankin  (4).  However,  the  type  of 
superimposition  discussed  there  does  not  include  these  characters. 
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pjkl    ,  whose  frame  would  be  something  like 


Since  the 


only  frames  in  the  frammar  are 


,  and 


□ 


or  results  of  combining  these,  characters  segmentable  into 


frames  like  those  above  are  not  accountable  for  by  the  grammar. 

There  are  two  final  observations.   First,  some  characters  which 
cannot  be  accounted  for  by  the  grammar  fall  into  more  than  one  of  the 
classes  discussed  above.   Second,  some  characters  which  appear  to  fall 
into  one  of  the  above  classes  are  actually  accounted  for  by  the  grammar. 
This  is  so,  because  there  is  always  the  option  of  listing  as  a  single 
component  those  complex  sub-parts  which  give  trouble.  For  example, 
the  complex  sub-part   y    is  listed  as  a  component,  even  though  it 
might  fall  into  class  seven  above,  and  the  complex  sub-part  /JQ 
is  listed  as  a  component  even  though  it  might  be  segmented  into  yl 
plus   \Z7    >  which  segmentation  would  cause  characters  containing 
JjQ        to  fall  into  class  three  above.  Generally,  decisions  on  seg- 
mentation are  motivated  by  three  considerations:   (1)  a  structure  is 
segmented  into  two  components  if  the  two  components  either  must  not 
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touch  or  need  not  touch;   (2)  a  structure  is  segmented  into  two  com- 


ponents if  the  segmentation  generally  corresponds  to  one  of  the  three 


recognized  frame  types: 


• 


,    and 


□ 


;    (3)   a 


structure  is   segmented  into   two  components   if  each   sub-structure   can 


occur  in  different  environments.   1/ 


2.        Tightness 


A.   Over-generation  of  output  characters 


Rather  than  sampling  random  output  with  a  view  to  assigning  a 


tightness  percentage,  we  feel  it  is  more  to  the  point  ot  sub-classify 


the  unacceptable  output  on  linguistic  grounds.  The  reason  is  that 


potentially  there  exist  infinitely  many  grammars  equivalent  to  GCC-3  in 


the  sense  that  they  all  generate  the  same  language,  all  having  different 


tightness  percentages  associated  with  them.  For  example,  we  could  con- 


struct a  GCC-4  from  GCC-3  by  substituting  the  following  set  of  trans- 


1/  For  a  previous  and  more  detailed  discussion  of  segmentation,  see 


Rankin  (4) ,  chapter  4. 
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itions  for  the  bottom-most  transition  which  produces  the  single 


symbol  G. 


Then  there  would  be  added  to  GCC-3  rules  by  which  Gl  would 


ultimately  determine  the  subclass  of  the  current  FREE  class  which 


contain,  e.g.,  fewer  than  five  strokes:  G2,  G3 ,  and  G4  would  res- 


pectively determine  subclasses  containing  5-to-10  strokes,  ll-to-15 


strokes,  and  more  than  15  strokes.   GCG-4  would  generate  precisely  the 


same  language  as  that  generated  by  GCC-3,  but  random  output  from  GCC-4 


would  contain  a  higher  percentage  of  acceptable  characters  than  random 


output  from  GCC-3.   This  is  so  because  GCC-4  has  a  higher  probability 


of  generating  single- component  characters,  and  all  of  these  are  by 


definition  acceptable. 


There  are  three  classes  of  unacceptable  output.   The  first  class 


of  unacceptable  output  contains  characters  of  excessive  size.  Here, 
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there  are  three  sub-classes.     First   there   is   a  sub-class  which   contains 
characters  of  excessive  overall   size.     Examples   are    (1)   1/     and   (2). 
The  problem  with  these  output   characters  is  that   the  frame  for  each  has 
the  maximum  number  of  sub-frames,    and  each  sub-frame   is  occupied  by  a 
component  which  is   itself  very  large.      Second,    there   is   a  sub-class 
containing  characters  which  are  excessively  horizontal.     Examples  are 
(3)   and   (4).     The  problem  here  is  that  the  frames  for  these  characters 
are  of  the  maximum  horizontality  allowed  by  the  grammar,    and  the  sub- 
frames   are  filled  with   "very  horizontal"  components.     Then,    there  is   a 
corresponding  excessively  vertical   sub-class.     Examples  are    (5)   and   (6), 
The  frames  for  these  characters  are  of  the  maximum  vertical ity  allowed 
by  the  grammar,    and  the  sub-frames  are  filled  with  "very  vertical" 
components . 


1/     This  output   character  and  the  following  five   (where  there  are 
numbers  in  parentheses)    are  found   in  Appendix  D. 
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The  second  class  of  unacceptable  output  contains  characters  in 
which  there  are  co-occurrence  problems.   That  is,  it  is  not  the  case 
that,  say,  every  WEST  can  occur  with  every  EAST  (even  given  the  strength 
and  adjunctiveness  constraints),  and  so  on  for  other  pairs  of  classes. 
The  one  pair  of  classes  which  gives  rise  to  the  greatest  number  of  co- 
occurrence problems  is  the  BORDER-INTERIOR  pair,  but  there  are  co- 
occurrence problems  for  every  pair.  Examples  are  yf  4r   >  "-^    »  an<^ 
?!,     ,  in  which  ft         and  J-^        »   </—    and   —   ,  and 
^T      and        ,  respectively  s imply  should  not  co-occur. 

Finally,  there  is  a  class  of  unacceptable  output  characters  which 
contains  certain  components  in  strong  positions,  these  components 
being  only  moderately  strong.  There  seems,  in  fact,  to  be  a  scale  of 

strength,  Among  WESTS,  for  example,  the  scale  ranges  from,  say 

/ 

(very  strong)  through  -^r-         (moderately  strong)  to   Jfi.     (weak). 
Currently  in  the  grammar,  the  class  of  strong  components  includes  very 
strong  and  moderately  strong  components.   This  blurring  of  the  scale  of 
strength  gives  rise  to  such  unacceptable  output  characters  as  ■jrc}f 
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,  where   ^y        and   _¥-   are  only  moderately  strong 
(  -jrr      and  J\_,        being  weak). 
B.   Over-generation  of  Derivations 

There  are  three  types  of  ambiguous  characters.   Actually,  ambiguity 
is  not  apparent  at  the  output  character  level,  but  only  at  the  post- 
transducer  level.  This  is  because  our  current  output  characters  contain 
parts  of  their  deriviations  -  their  frames.  For  example,  at  the  output 


character  level,  the  following  are  distinct: 


tIA 

and 

f  tf 

However,  at  the  post-transducer  level  both  would  be  realized  as 


m  • 


which  is  for  that  reason  ambiguous.  In  this  treatment,  we  will  have  to 
discuss  ambiguity  as  if  we  already  had  the  output  transducer  referred 
to  on  page  iv  and  we  will  use  the  term  "character"  to  mean  "post- 
transducer  character." 

First,  some  characters  which  are  (or  contain)  horizontal  arrays 
of  three  or  more  components  are  ambiguous  with  respect  to  the  grouping 
of  components.  For  example,  'fie/r  has  two  derivations,  one  of  which 

imposes  a   '[•    plus  ffln"   grouping,  the  other  of  which  imposes  a 
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•  Lfflf        plus      fr        grouping.      Then   some  of   the   characters  which   are 
(or  contain)   vertical   arrays  of   three  or  more   components   are   ambiguous, 

—  Jo" 

The  sub-part   o    of  the  character  'fl.  has  two  derivations,  one 
"J 

of  which  imposes  a   "q"   plus   -4—   grouping,  the  other  of  which 

imposes  a   plus   jr.   grouping. 

This  type  of  ambiguity  is  introduced  in  the  first  step  of  the 
process  that  generates  Stage  2,  that  is,  in  the  selection  of  frames 
from  the  table  of  correspondence.   It  is  possible  for  two  distinct 
sequences  of  frames  (which  are  generated  by  the  same  output  string) 
ultimately  to  determine  the  same  character.  This  is  shown  in  the 
following  two  derivations  of  TEjI        •  In  these,  and  the  derivations 
to  follow,  the  single  arrow  -»  is  used  for  "directly  determine"  and 
the  double  arrow  =£  for  "ultimately  determine". 
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Two  derivations  of  ' *J3 


KHM 

KHM 

' 

■ 

H 

W 
s 

H 

sf 

E 

H 

H 

E 

s 

s  f 

Ef 

V 


V 


W   -li£E 


E 
f 

E 

s 

V 


V 


w 


Output 
Transducer 


k 


Output 
Trans  ducer. 


m 


m 
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Second,  there  are  those  characters  which  are  themselves  (or 
which  contain  components  which  are)  further  segmentable  into  components, 
For  example,    «•    is  listed  in  the  lexicon  as  a  single  component  - 
because  of  its  frequency  of  occurrence  (specifically,  it  is  a  strong 
EAST).  It  is  also  the  case  that   r»    can  be  generated  by  the  grammar 
as  /!_,    in  construction  with  ^^ 

This  type  of  ambiguity  is  introduced  at  Stage  1,  where  it  is 
possible  for  two  distinct  output  strings  from  the  state  diagram  ulti- 
mately to  determine  the  same  character. 
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KM 


N7f 


Output       """"N 
Transducer  J 


/L 


Output 
Transducer 
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Finally,    some   characters   are   (or   contain  sub-parts  which  are) 
anal yz able   in  more  than  one  way-but   into  the  same  frame.      An  example 
is    -5f-        which  is   analyzable  as  either 


J>/ 

or 

N/ 

a- 

ft 

.   So 


«./ 


•Jf-  in  isolation  is  ambiguous,  and  any  character  which  contains 

j£_    as  a  sub-part  is  ambiguous. 

This  type  of  ambiguity  is  introduced  (as  in  the  first  type  dis- 
cussed here)  in  the  frame  selection  process.  However,  unlike  the  first 
type,  these  characters  have  identical  frames  at  the  output  character 
level . 
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Two  derivations   of      -J^- 


KM 


V 


KM 


v 


N,f 


Nsf 


Sf 


\ 


tH- 


Output 
Transducer 


v/ 


*t 


Output 
Transducer 


?1- 
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APPENDIX  A 
Classes  of  Mathews'  entries  which  are  deliberately  excluded  from  the 
corpus   (Mathews*  sometimes  lists  several  semantical ly  related  entries 
under  one  number.  The  number  in  parentheses  indicates  which  of  these 
entries  is  referred  to . ) 

The  following  classification  of  excluded  Mathews'  entries  arises 
out  of  a  first  approximation  to  a  filtering  process  aimed  at  excluding 
entries  which  are  not  totally  well-formed.  We  selected  these  particular 
ones  in  the  following  way.  First,  Mathews'  lists  all  of  these  char- 
acters as  secondary  or  tertiary  variants.   Second,  the  informant  con- 
siders most  of  them  to  be  less  than  acceptable. .  Third,  they  do  not 
fit  well  into  the  current  grammatical  framework.  This  is  by  no  means 
a  definitive  listing  or  classification. 


64 


1.   "Radicals"  which  are  cited  in  Mathews'  as  separate  entries 


but  which  do  not  occur  in  isolation  in  everyday  usage.  1/ 


240 

(3) 

& 

1282 

(2) 

1_ 

1373 

(2) 

K>7 

2735 

(3) 

1  xx 

2989 

(2) 

t 

3037 

(2) 

I5 

3097 

(2) 

\ 

3153 

(2) 

A 

4737 

(2) 

% 

5570 

(2) 

1439  (2 

1650  (2 

2735  (2 

5788  (3 

5838  (2 

5922  (2 

6124  (2 

6739  (3 

7666  (2 


<« 


/ 

'J 

+  •/- 


1/  A  comparable  example  in  English  lexicography  is  the  listing  of  bound 
morphemes  (e.g.,  the  prefix  "pre-"),  which  do  not  occur  as  free  words. 
Many  Chinese  dictionaries  do  not  even  list  these  radicals  as  lexical 
entries. 
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2.  Printed  forms  whose  hand-written  variants  are  acceptable.  1/ 

297    (2)   g  5283   (3)   ^ 

2309   (3)   jT)  5788   (2)    -jr 

3.  Abbreviations  whose  unabbreviated  forms  are  acceptable: 

560  (2)    J*L  3366   (3)   yjv. 

921  (2)    ^  3617   (2)    -J£] 

1082  (2)   ^y  3953   (2)    ^ 

1205  (2)    |g  6534   (2)    £ 

2222  (2)    ^  7030   (2)    7J 

2753  (2)   iQ_  7615   (2)    ^ 


1/  This  list  includes  only  those  printed  forms  for  which  Mathews* 
lists  acceptable  hand-written  variants.  In  other  cases,  Mathews'  lists 
printed  forms  only  (without  their  hand-written  variants).  We  account 
for  these  by  copying  into  our  lexicon  only  their  hand-written  variants. 
For  example,  we  list  in  our  lexicon  ?    for  the  Mathews'  entry  -    , 
J      for   X    ,    iL    for   *|-    ,  and  ^        for  -^- 
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4.   Pictorially  bizarre  variants  of  acceptab 

3883  (3 

3953  (3 

3992  (2 

4083  (2 

4464  (2 

4725  (2 

5603  (2 

7044  (2 

7519  (2 


666 

(2) 

7& 

1478 

(2) 

A.  J  *. 

1517 

(3) 

# 

2451 

(2) 

16  £ 

2896 

(2) 

4 

3342 

(2) 

3406 

(3) 

5780 

(2) 

6209 

(2) 

?JS 

e  characters 


* 


T 
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APPENDIX  B 

1.  Variation  Processes  and  List 

Component  variation  has  been  extensively  studied,  but  a  final 

statement  on  this  phenomenon  is  not  ready  for  presentation.  What 
follows  is  a  statement  of  four  general  processes  and  a  list  of  rather 
more  ad  hoc  variations.  Component  variation  belongs  properly  in  the 
rules  of  the  output  transducer,  which  has  not  been  constructed,  and 
is  presented  here  mostly  as  an  aid  to  human  generation  of  output  char- 
acters. 

Further,  some  of  the  general  variation  processes  may  not  be 

precisely  stated  here,  and  there  are  indications  that  some  of  the 

instances  of  ad  hoc  variation  should  be  grouped  together  into  general 

processes. 

Variation  Process  1^,   (BORDER) 

For  certain  components  whose  base  forms  1/  have       or    [^ 


1/  A  base  form  is  that  form  of  a  component  which  can  occur  in  isolation, 
those  which  belong  to  FREE. 
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in  their  "eastern"  portions,  this  process  causes  the  BORDER  variant 
to  assume  a  shape  wherein        becomes        and        becomes 

I   (    .   Examples:  2.5.7:  f\^      -  j] ,  ,  12.8.1:   *±    -  ^g_ 

Variation  Process  2_,   (WEST) 

For  certain  components  which  have   in  their  "south  central" 

portions,  this  process  causes  the  WEST  variant  to  assume  a  shape  where- 
becomes   /    .   Examples:   4.1.7:  ir        ->         i^   ,  and 


in 


8.1.3:   >6-   -   >£ 

Variation  Process  3_,   (WEST) 
For  certain  components  which  contain  ^£         or  a   ^J<   -like 
structure,  this  process  causes  the  WEST  variant  to  assume  a  shape 
wherein   ^-^    becomes   v     .   Examples:   4.4.21:  rf\.        ->   5F   > 
and  7.4.6:  j£    _+   Jj? 

Variation  Process  4,  (BORDER) 
For  certain  components  which  have   V    or   V    in  their  "south- 
eastern" portions,  this  process  causes  the  BORDER  variant  to  assume  a 
shape  wherein   \   or   \    become   \^_   .   Examples:   5.4.5:  u\ 
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/v  -  A-  '  and  5-4-8:  tK  -   ^ 


5.   Ad  hoc  variations 


Number 


Base  Form 


Position 


Variant  Form 


2.1.2 


2.2.3 


2.4.2 


2.4.4 


2.5.3 


2.8.14 


3.1.4 


3.1.6 


3.1.13 


3.2.2 


3.2.3 


3.2.7 


3.2.8 


3.3.1 


+ 

B 

r 

X. 

N 

X 

/v 

N,S 

/v 

A 

I    (sometimes) 

A 

7b 

B 

71) 

f- 

W 

1 

a 

B 

a 

-^- 

sometimes 

^ 

+ 

B 

f 

-T- 

B 

*- 

4- 

I    (sometimes) 

f 

i^- 

S 

# 

t 

B 

-t 
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Number  Base  Form 


3.4.1 


3.5.8 


f 

4.1.6  j=\ 

4.1.18  £} 

4.2.9  -41 

4.2.16  -i- 

4.3.11  ^ 

4.4.4  AT 


4.4.16 


4.4.21 


4.5.9 


4.5.11 


4.7.5 


4.8.10 


9 


z 


Position 

Variant  Form 

N  (sometimes) 

A. 

N  (sometimes) 

4* 

S 

n 

sometimes 

Q 

E 

t 

N,S 

4- 

B 

-x 

N 

A_ 

S 

X 

B 

*-_ 

S 

<<- 

S  (sometimes) 

* 

B 

J 

B 

'} 

B 

f- 

B 

71 


Number 

Base  Form 

Position 

Variant  Form 

5.2.4 

w 

B 

$  P 

5.3.10 

/I 

B 

n 

5.4.13 

%. 

W 

% 

5.4.19  7v  N  ^ 

5.8.6  ;iti  B  ^     L 

"£  s  •£• 


6.1.5 


6.1.18 


6.2.3 


6.3.2 


6.3.17 


6.7.1 


8.2.7 


8.3.6 


4  b  *r 

w  J[ 

-f-  w  f- 


6.3.13  £  W  £ 


*|y 


8.1.6  m  N  *H 
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Number 


Base  Form 


Position 


Variant  Form 


9.3.3 


9.4.4 


11.3.1 


11.8.1 


12.3.2 


t 


bt. 


13.3.1 


M 


N 


W 


B 


N 


K 


y 


^ 


N 


ytti 

iSi 


13.7.2 


14.1.1 


£3 


/I 


B 


£3 

71 
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APPENDIX  C 
The  Lexicon 

The  lexicon  is  described  on  pages  34  through  39.   There  is,  how- 
ever, one  extra-grammatical  type  of  information  included  in  the  lexicon 
which  must  now  be  explained.   There  are  superscripts  (1-5)  on  some  of 
the  tallies,  1/  the  significance  of  these  superscripts  is  that  the 
component  on  the  row  of  this  tally,  when  it  occurs  in  the  position 
class  specified  by  the  column  heading  of  the  tally,  undergoes  a  shape 
variation.   Superscripts  1-4  refer  to  variation  processes  1-4  des- 
cribed in  Appendix  B;  superscript  5  refers  to  the  ad-hoc  list  of 
variants  in  Appendix  B.  The  ad-hoc  list  follows  the  order  of  appear- 
ance of  components  in  the  lexicon. 


1/  In  two  cases  the  component  itself  is  superscripted.  There  are 
cases  where  there  is  more  or  less  free  variation  among  the  component 
variants. 
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1.1.1 


1.7.1 


WENSBIFVCTD 


1.8.1 


1.8.2 


L 


1.8.3 


L 


1.8.4 


1 


2.1.1 

2.1.2 

2.1.3 

2.1.4 

2.2.1 

2.2.2 
2.2.3 

2.2.4 

2.2.5 

2.3.1 


H 


-t- 


i 


A 
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2.3.2 

r 

w 

E 

N 

S 

B           I 

F          V          C         T        D 

h 

S 

i 

s 

s 
s 

s 
1 

s 

s 
s 
s 

s 

s 

s 

s 

s 

5 

s 

s 

-5 
s 

1 

s 

s 

s 

I 

s 

5 
s 

s 

s 

s 
I 

s 

X 
X 

1 

X 

X 

X 
X 

X 

X 

x5 

X 
X 

X 

X 
X 
X 
X 
X 
X 
X 

X 
X 

X 

X 

X 
X 

X 

X 

2.3.3 

2.4.1 

2.4.2 

2.4.3 

2.4.4 

2.5.1 

2.5.2 

2.5.3 

2.5.4 

2.5.5 

2.5.6 

2.5.7 

2.6.1 

2.6.2 

2.7.1 

2.7.2 

i 

X 

A 

/\ 

Jl 

P 

X 

r 

V 

r 

^ 

<7 

X 

/ 

T 

1T"7 
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2.7.3 

2.7.4 

2.8.1 

2.8.2 

2.8.3 

2.8.4 

2.8.5 

2.8.6 

2.8.7 

2.8.8 

2.8.9 

2.8.10 

2.8.11 

2.8.12 

2.8.13 

2.8.14 

i 



W 

E 

N 

s 

B 

I 

F          V          C         T        D 

'J 

8 

S 
8 

s 

s 
s 

8 

8 

S 

8 
I 

8 
8 

S 

8 

S 

8 
I 

8 
8 

S 

8 
8 

8 

8 
8 
S 
8 
S 

1 

X 
X 

X 

X 

5 

X 

X 

X 

X 

X 

X 
X 
X 

X 
X 

X 
X 
X 
X 
X 

X 

X 
X 
X 
X 
X 

X 

1 

d 

/L 

i 

XL 

n 

;l. 

c 

5 

1 

1 

L, 

t 

-L 

/I? 
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w 

E 

N 

s 

B            I 

F          V          C         T        D 

3.1.1 

± 

2 

8 

2 

s 

5 

s 

s 
s 

a 

s 

s 
s 

s 
s 
s 

s 

s 

s 

s 
s 

8 

8 
8 
8 
8 

8 
S 

8 

8 
S 
8 

8 
8 

8 

8 

8 

I 

8 
8 

8 
8 
8 

5 

X 
X 

5 

X 

5 

X 

X 
X 
X 

X 

X 
X 

X 

X 

X 
X 
X 
X 

X 
X 

X 
X 
X 

X 

X 
X 

X 

X 
X 

X 

X 

X 

X 
X 
X 
X 

X 

X 
X 

3.1.2 
3.1.3 
3.1.4 
3.1.5 

X. 

J- 

f 

i 

3.1.6 

3.1.7 

3.1.8 

3.1.9 

3.1.10 

3.1.11 

3.1.12 

3.1.13 

3.1.14 

3.2.1 

3.2.2 

3.2.3 

a 

X 

\/ 



-E: 

A- 

-£ 

5 

-3- 

/ 

* 

f 

■f 
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3.2.4 

W 

E 

N 

S           B           I 

F          V          C          T        D 

"1 

s 

8 

X 

3.2.5 

F 

8 

S 

3.2.6 

<h 

8 

I 

8 

8 

X 

X 

3.2.7 

+ 

8 

5 

X 

X 

3.2.8 
3.2.9 
3.2.10 

■w 

S 

5 

8 

X 

X 
X 

Y 

i 

3.2.11 

\ 

8 

3.2.12 

3.3.1 

3.3.2 

t 

I 
S 

8 

X 

X5 

X 
X 

X 

X 

3.3.3 
3.3.4 

k 

I 

8 
S 

8 

8 

8 

X 

X 
X 

X 
X 

X 

V 

3.3.5 

& 

S 

S 

X 

X 

3.3.6 

?L 

S 

8 

X 

3.3.7 
3.3.8 

t 

s 
s 

8 

3 

a 

X 

X 
X 

/L 
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3.3.9 

3.3.10 

3.3.11 

3.3.12 

3.3.13 

3.3.14 

i 
3.3.15 

3.3.16 

3.3.17 

3.3.18 

3.4.1 

3.4.2 

3.4.3 

3.4.4 

3.4.5 

3.5.1 

3.5.2 

i 

w 

E 

N 

S           B          I 

F          V          C         T        D 

A 

8 
S 

8 
S 

8 
S 

s 
s 
a 

I 

8 
I 

8 

S 

8 

S 

8 
S 
8 

5 

8 

i 

s 

s 

a 

8 
8 
I 

X 

X 
X 

X 

X 
X 
X 

X 

X 

X 
X 

X 
X 
X 

X 

X 

X 

X 

X 
X 
X 

X 

X 

X 

1 

9 

(r 

T 

h 

i 1 

f 

/A 

1  Is 

A 

3t 

X. 

£_ 

*- 

t 

P 
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WE  NS  BI  FVCTD 


-5 

s 


81 


x  x 


3.8.5 


3.8.6 


3.8.7 


3.8.8 


3.8.9 


3.8.10 


3.8.11 


3.8.12 


3.8.13 


3.8.14 


3.8.15 


4.1.1 


4.1.2 


4.1.3 


4.1.4 


4.1.5 


e. 


-t, 


5 


<« 


n 


/L 


^ 


t 


yiL 


*L 


E7 


a 


A 


3 


■Sr 


WENSBIFVCTD 


S 
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4.1.6 

4.1.7 

4.1.8 

4.1.9 

4.1.10 

4.1.11 

4.1.12 

4.1.13 

4.1.14 

4.1.15 

4.1.16 

4.1.17 

4,1.18 

4.1.19 

4.1.20 

4.2.1 

4.2.2 

w 

E 

N 

S           B           I 

F          V          C          T         D 

M 

s 

2 
s 

2 

s 

8 
S 

i 

8 
S 

s 
s 

8 

8 
8 

8 

I 

8 

8 
8 
8 
8 
S 

8 

8 

8 
8 

5 

8 

8 
8 

8 

8 
8 

X 

X 
X 
X 

X 

X 

X 

X 

X 
X 

X 
X 
X 

X 

X 
X 
X 
X 
X 
X 

X 
X 
X 
X 

X 

X 

X 
X 
X 

X 

X 
X 

X 
X 

X 

ah 

_L 

Jt_ 

iL 

JL 

± 

-y- 

JL 

-fr 

■ja- 

b 

Q5 

A 

#■ 

+ 

-t- 
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i 

w 

E 

N 
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i 

3.2.11 

\ 

3.6.1 

* 

3.3.12 

C- 

3.6.2 

EAST  only: 

.5.12  ^fe 


4.6.1 


5.3.16  ^ 


6.3.1 


7.5.2 


2.5.5  V  3.7.3  \— : ?  4.7.3 
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207.3  i| 

NORTH  only: 

2.1.1  /-  2.7.2  i—?  4.1.9  Jt£ 

2.1.3  .3-  2.8.9  ^7  4.2.14  +^ 

2.1.4  -1-  3.1.11  y^v  4.3.19  5"? 
2.3.3  7  3.3.9  Z;  4.5.6  <0 


K 


4.7.4  *±-,  7.7.1  Jk,  10.7.1  ^ 

5.4.23  /K.  8.7.1  jg^  11.1.7  {ffl 

6.3.20  A/:  8.7.3  %xx,  13.7.1  ^^ 

6.4.8  /^  8.7.4  ^L,  15.7.1  ffi? 

6.7.2  ^  9.7.1  ^^ 


SOUTH  only: 


3.8.14 

;iu 

4.3.18 

ni\ 

6.4.12 

* 

BORDER  only: 

1.8.3 

L 

3.3.16 

f 

5.6.1 

1.8.4 

1 

3.4.4 

JL. 

6.3.8 

n 

2.2.4 

u 

3.5.7 

1 

r 

6.5.4 

r 

2.5.4 

/ 

3.8.10 

h 

6.8.7 

A 

2.5.6 

r 

4.1.17 

Q 

7.5.3 

/ 

2.8.7 

c 

4.4.18 

i_ 

7.8.2 

* 

3.1.14 

/■ 

4.5.10 

> 

8.3.12 

*s 

3.2.12 

r 

4.8.7 

^ 

12.1.12 

a  o 

o  a 
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FREE  only; 

1.7.1  J 

3.2.9  Y  5.1.24        £b 


3.4.5 

1 

5.3.13 

V 

3.6.3 

1 

5.8.7 

m, 

3.7.2 

t 

6.1.13 

n 

4.1.1 

E? 

6.2.8 

% 

4.1.20 

# 

6.3.11 

1 

4.2.12 

ft 

6.3.12 

4.4.19 

^ 

6.3.14 

4.8.8 


7.3.10 

jf> 

8.3.11 

IL 

8.4.3 

7K 

8.8.2 

~$f- 

10.3.4 


11.1.4 


11.5.1 


14.2.1 


16.3.1 


4thL 
ft 


mi 

-^  6.7.3  ^  17.3.2  Ml 
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APPENDIX   D 


£3 


K&W& 


^ 
3 


>tfc> 


*** 

A 

1A 


1M. 
W 
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